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Abstract 

Regularized risk minimization with the binary hinge loss and its variants lies at the 
heart of many machine learning problems. Bundle methods for regularized risk 
minimization (BMRM) and the closely related S VMStruct are considered the best 
general purpose solvers to tackle this problem. It was recently shown that BMRM 
requires 0(1/ e) iterations to converge to an e accurate solution. In the first part of 
the paper we use the Hadamard matrix to construct a regularized risk minimization 
problem and show that these rates cannot be improved. We then show how one can 
exploit the structure of the objective function to devise an algorithm for the binary 
hinge loss which converges to an e accurate solution in 0(1/ ^/e) iterations. 

1 Introduction 

Let X; e X C R d denote samples and y; e y be the corresponding labels. Given a training set of n 
sample label pairs {(xj, yi)}?—-^, drawn i.i.d. from a joint probability distribution on X x y, many 
machine learning algorithms solve the following regularized risk minimization problem: 

1 -J 1 

min J(w) := AO(w) + i? C m P (w), where i? emp (w) := - V" Ifc, y*; w). (1) 
w n z — ' 

i=i 

Here 2(x,,yi;w) denotes the loss on instance (x,,yj) using the current model w and -R C mp(w), 
the empirical risk, is the average loss on the training set. The regularizer fi(w) acts as a penalty 
on the complexity of the classifier and prevents overfitting. Usually the loss is convex in w but can 
be nonsmooth while the regularizer is usually a smooth strongly convex function. Binary Support 
Vector Machines (SVMs) are a prototypical example of such regularized risk minimization problems 
where y = {1,-1} and the loss considered is the binary hinge loss: 

l(xi,yi;w) = [1 - yi (w,Xj)] + , with [-] + :=max(0, •)■ (2) 

Recently, a number of solvers have been proposed for the regularized risk minimization problem. 
The first and perhaps the best known solver is SVMStruct [1], which was shown to converge in 
0(l/e 2 ) iterations to an e accurate solution. The convergence analysis of SVMStruct was improved 
to 0(1/ e) iterations by [2], In fact, [2] showed that their convergence analysis holds for a more 
general solver than SVMStruct namely BMRM (Bundle method for regularized risk minimization). 

At every iteration BMRM replaces i? cmp by a piecewise linear lower bound R c t p and optimizes 

min Jf (w) := Af2(w) + i?j P (w), where R^ p (w) := max (w, a,) + (3) 

w Ki<t 
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to obtain the next iterate w t . Here € dR emp (wi_i) denotes an arbitrary subgradient of R emp 
at Wj_i (see Section 2) and 6, = R emp (wi-i) — (wj_i,aj). The piecewise linear lower bound is 
successively tightened until the gap 

£t := ™'<t J( - W *'' ) ~ J *( w *)' (4) 
falls below a predefined tolerance e. 

Even though BMRM solves an expensive optimization problem at every iteration, the convergence 
analysis only uses a simple one-dimensional line search to bound the decrease in e t . Furthermore, 
the empirical convergence behavior of BMRM is much better than the theoretically predicted rates 
on a number of real life problems. It was therefore conjectured that the rates of convergence of 
BMRM could be improved. In this paper we answer this question in the negative by explicitly con- 
structing a regularized risk minimization problem for which BMRM takes at least 0(1/ e) iterations. 

One possible way to circumvent the 0(1/ e) lower bound is to solve the problem in the dual. Using 
a very old result of Nesterov [3] we obtain an algorithm for SVMs which only requires 0(1/ y/e) 
iterations to converge to an e accurate solution; each iteration of the algorithm requires 0(nd) work. 
Although we primarily focus on the regularized risk minimization with the binary hinge loss, our 
algorithm can also be used whenever the empirical risk is piecewise linear and contains a small 
number of pieces. Examples of this include multiclass, multi-label, and ordinal regression hinge 
loss and other related losses. 



2 Preliminaries 

In this paper, lower bold case letters {e.g., w, /x) denote vectors, Wi denotes the i-th component of w 
and At refers to the k dimensional simplex. Unless specified otherwise, ||-|| refers to the Euclidean 

norm ||w|| := 

(E"=i w 1) 2 -R:=lU {oo}, and [t] := {1, . . . , t}. The dom of a convex function 
F is defined by dom F := {w : F(w) < oo}. The following three notions will be used extensively: 

Definition 1 A convex function F : R n — > R is strongly convex (s.c.) wrt norm || • || if there exists a 
constant a > such that F — ^|| • || 2 is convex, a is called the modulus of strong convexity of F, 
and for brevity we will call F a -strongly convex or a-s.c. 

Definition 2 A function F is said to have Lipschitz continuous gradient (leg) if there exists a 
constant L such that 

||VF(w) - VF(w')|| <L||w-w'|| VwWw'. (5) 
For brevity, we will call F L-l.c.g.. 

Definition 3 The Fenchel dual of a function F : Ei — > E 2 , is a function F* : E% — > E* given by 

F*(w*)= sup {(w,w*> -F(w)} (6) 

The following theorem specifies the relationship between strong convexity of a primal function and 
Lipschitz continuity of the gradient of its Fenchel dual. 

Theorem 4 ([4, Theorem 4.2.1 and 4.2.2]) 

1. IfF : R n -» 1 is a-strongly convex, then dom F* = M™ and VF* is —Leg. 

2. IfF : R n — > R is convex and L-l.c.g, then F* is —strongly convex. 

Subgradients generalize the concept of gradients to nonsmooth functions. For w e dom F, (i is 
called a subgradient of F at w if 

F(w') > F(w) + (w' - w, fi) Vw'. (7) 

The set of all subgradients at w is called the subdifferential, denoted by dF(w). If F is convex, 
then dF(w) ^ for all w e dom F, and is a singleton if, and only if, F is differentiable [4]. 
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Any piecewise linear convex function F(w) with t linear pieces can be written as 

F(w) =max{(a ! ,w) +b t }, (8) 
»e[t] 

for some and 6j. If the empirical risk i? omp is a piecewise linear function then the convex opti- 
mization problem in (1) can be expressed as 

minJ(w) := minmax{(aj, w) + 6j} + Af2(w). (9) 

W W »£[(] 

Let A = [ai . . . a t ], then the adjoint form of J(w) can be written as 

D(a) := -Aft*(-A _1 Aa) + (a, b) with aeA ( (10) 
where the primal and the adjoint optimum are related by 

w* = 0fi*(-A _1 Aa*) (11) 
In fact, using concepts of strong duality (see e.g. Theorem 2 of [5]), it can be shown that 

inf |max(a t ,w) + b t + Xil(w)\ = sup {-AO^-A^ 1 Aa) + (a, b) } (12) 

wGR d l»e[n] J aeA t 

3 Lower Bounds 

The following result was shown by [2]: 

Theorem 5 (Theorem 4 of [2]) Assume that J(w) > Ofor all w, and that ||<9 w -Remp(w)|| < G for 
all w £ W, where W is some domain of interest containing all w t < for t' < t. Also assume that fl* 
has bounded curvature, i.e. let ||<9^Q*(/z)|| < H* for all /x e { — X^ 1 Act where a. € A t }. Then, 
for any e < AG 2 H* /A we have e t < e after at most 



AJ(0) 8G 2 H* A 

l0g w + ^- 4 (13) 



steps. 



Although the above theorem proves an upper bound of 0(1/ e) on the number of iterations, the 
tightness of this bound has been an open question. We now demonstrate a function which satisfies 
all the conditions of the above theorem, and yet takes iterations to converge. 

To construct our lower bounds we make use of the Hadamard matrix. Annxn Hadamard matrix is 
an orthogonal matrix with {±1} elements which is recursively defined for d = 2 k (for some k): 

Note that all rows of the Hadamard matrix Hd are orthogonal and have Euclidean norm Vd. Con- 
sider the following dx d orthonormal matrix 

A . = J_( H d/2 -H d/2 
V ~ H <i/2 H d/2 

whose columns a^ are orthogonal and have Euclidean norm 1, which is used to define the following 
piecewise quadratic function: 

J(w)=max(a t ,w) + ^||w|| 2 . (14) 



R Afi(w) 

Theorem 6 The function J(w) defined in (14) satisfies all the conditions of Theorem 5. For any 
t < | we have e t > 
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Proof By construction of A, we have maxjg^] (a i; w) > 0. Furthermore, -|||w|| 2 > 0. Together 
this implies that J(w) > 0. Furthermore, <9 w 7? omp (w) are the columns of A, which implies that 
||9 w -R e mp(w)|| < 1. Since we set O(-) = ±|| • || 2 it follows that £}*(•) = ±|| • || 2 . Therefore 
||<9 2 0*(-)|| = i. Hence, (14) satisfies all the conditions of Theorem 5. 

Let Wo, Wi . . . w t denote the solution produced by BMRM after t iterations, and let a io , a il . . . a it 
denote the corresponding subgradients. Then 

J t (w) = max(a i3 , w) + ^||w|| 2 with w t = argmin J t (w). (15) 
If we define A t = [a^ ] with j e [t] then 

Jt(w t ) = min max(a I w) + ^||w|| 2 = min max w T A t c* + ^-||w|| 2 

wer j£[t] V J ' 2 weR d aeA* 2 

= max min w T A 4 a + ^||w|| 2 ( See Appendix A for details) (16) 

aeA' weR d 2 

= max — ^-a T A7A t o; = — min ^-ct T A7A t a. 

aeA' 2A * aeA* 2A * 

Since the columns of A are orthonormal, it follows that Aj A = I t where l t is the t x t dimensional 
identity matrix. Thus 

J t (w t ) = -\ min ||a:|| 2 = 
n tJ 2A aeA* 11 2Xt 

Combining this with J(w t >) > for t' £ [t] and recalling the definition of e t from (4) completes 
the proof. ■ 

In fact, s t is a proxy for the primal gap 

S t = min J(wt') — J(w*). 
*'e[t] 

Since J(w*) is unknown, it is replaced by Jt(w t ) to obtain e t - Since J(w*) > Jt(w t ), it follows 
that et > 5t [5]. We now show that Theorem 6 holds even if we replace e t by S t . 

Theorem 7 Under the same assumptions as Theorem 6, for any t < | we have S t > j^. 

Proof Note that J*(w) is minimized by setting a = je, where e denotes the t dimensional vector 
of ones. Recalling that £!*(•) = i|| • || 2 and using (11) one can write w t = — j^A t e. Since 
the columns of A are orthonormal, it follows that (ai,Wt) is — ^ if a; is a column of A t and 
otherwise. Therefore, max ig [ d ] (a^, w t ) = 0. On the other hand, by noting that Aj A = I t we 
obtain |||w t || 2 = ^j. Plugging into (14) yields J(w t ) = ^ and hence min t / e[t ] J(w t ') = sk- 
it remains to note that J(w) > 0, while J(0) = 0. Therefore J(w*) = J(0) = 0. ■ 



4 A new algorithm with convergence rates 0(1/ ^/e ) 

We now turn our attention to the regularized risk minimization with the binary hinge loss, and 
propose a new algorithm. Our algorithm is based on [3] and [6] which proposed a non-trivial scheme 
of minimizing an L-l.c.g function to e-precision in 0(1 j \fe) iterations. Our contributions are two 
fold. First, we show that the dual of the regularized risk minimization problem is indeed a L- 
l.c.g function. Second, we introduce an 0(n) time algorithm for projecting onto an n-dimensional 
simplex or in general an n-dimensional box with a single linear equality constraint, thus improving 
upon the 0(n log n) deterministic algorithm of [7] (which also gives a randomized algorithm having 
expected complexity 0{n)). This projection is repeatedly invoked as a subroutine by Nesterov's 
algorithm when specialized to our problem. 

Consider the problem of minimizing a function J(w) with the following structure over a closed 
convex set Qi: 

J(w)=/(w)+ 5 *(Aw). (17) 
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Algorithm 1 Pragam: an 0(1/ k 2 ) rate primal-adjoint solver. 

Input: L as a conservative estimate of (i.e., no less than) the Lipschitz constant of Vf(a). 
Output: Two sequences and atk which reduce the duality gap at 0(1/ k 2 ) rate. 

1: Initialize: Randomly pick a _i in Q 2 . Let /x = 2L, c* <— v(at-i), w <— w(a_i). 

2: for fc = 0,1, 2,... do 

3: Let r fe = /3 fe <- (1 - r fe )a fe + na^ (w fe ). 

4: Set w fe+ i <— (1 - r fe )wfe + r fc w(/3 fc ), <- w(/3fc), ^fc+i <- (1 - T fc )^ fc . 

5: end for 



Here / is strongly convex on Qi, A is a linear operator which maps Qi to another closed convex 
set Q 2 , and g is convex and Leg on Q 2 . [6] works with the adjoint form of J: 

D(a) --. 9 (a)-r(-A T a), (18) 

which is Z.c.g according to Theorem 4. Under some mild constraint qualifications which we omit for 
the sake of brevity (see e.g. Theorem 3.3.5 of [8]) we have 

J(w) > D(a) and inf J(w) = sup D(a). (19) 

By using the algorithm in [3] to maximize D(a) one can obtain an algorithm which converges to an 
e accurate solution of J(w) in 0(1/ y/e) iterations. 

The regularized risk minimization with the binary hinge loss can be identified with (17) by setting 

A 1 " 

JM = o !|w|| 2 + min -£[1- ^« x *' w ) + & )]+ ( 20 ) 

/(w) " " ' 

ff * (Aw) 

The latter, 5*, is the dual of g(a) = -J2i a i ( see Appendix C). Here Qi = R d . Let A := YX T 
where Y := diag(yi, . . . , y n ), X := [xi, . . . , x„]. Then the adjoint can be written as : 

D(a) := -g(a) - f*(-A T a) = V on - ^-a T YX T XYa with (21) 

i 

Q 2 = jaG [O^- 1 ]" i^^a^oj. (22) 

In fact, this is the well known SVM dual objective function with the bias incorporated. 

Now we present the algorithm of [6] in Algorithm 1. Since it optimizes the primal J(w) and the 
adjoint D(a) simultaneously, we call it Pragam (PRimal- Adjoint GAp Minimization). It requires 
a (72-strongly convex prox-function on Q 2 : d 2 (ot) = ^.||a|| 2 , and sets D 2 — max Q , e g 2 d 2 (a). Let 
the Lipschitz constant of \7D(a) be L. Algorithm 1 is based on two mappings a M (w) : Qi 1— > Q 2 
and w(a) : Q 2 1— » Qi, together with an auxiliary mapping w : Q 2 1— » Q 2 . They are defined by 

a M (w) := argmin / ud 2 (a) — (Aw, a) + 5(a) = argmin — ||a|| 2 + w T XYa — a,, (23) 
w(a) := argmin (Aw, a) + /(w) = argmin -w T XYo + — ||w|| 2 = -XYa, (24) 

wGQi w£K d 2 A 

L 2 

u(a) := argmin — \\a — a\\ — (VD(a), a — a) . (25) 

a'£Q 2 2 

Equations (23) and (25) are examples of a box constrained QP with a single equality constraint. In 
the appendix, we provide a linear time algorithm to find the minimizer of such a QP. The overall 
complexity of each iteration is thus 0(nd) due to the gradient calculation in (25) and the matrix 
multiplication in (24). 
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dataset 


n 


d 


s(%) 


dataset 


n 


d 


s(%) 


dataset 


n 


d 


s(%) 


adult9 

astro-ph 

aut-avn 


32,561 
62,369 
56,862 


123 

99,757 
20,707 


11.28 
0.077 
0.25 


covertype 

news20 

real-sim 


522,911 

15,960 

57,763 


6,274,932 
7,264,867 
2,969,737 


22.22 
0.033 
0.25 


reuters-cl 1 
reuters-ccat 
web8 


23,149 
23,149 
45,546 


1,757,801 
1,757,801 
579,586 


0.16 
0.16 
4.24 



Table 1: Dataset statistics, n: #examples, d: #features, s: feature density. 



4.1 Convergence Rates 

According to [6], on running Algorithm Pragam for k iterations, the a.^ and satisfy: 

ALD 9 

ForSVMs, L=\ ||A|| 2 2 where ||A|| 12 = max{(Aw,a) : ||a|| = 1, ||w|| = 1}, a 2 = I, D 2 = 
j-. Assuming ||xj|| < R, 

|(Aw, a)| 2 < ||«|| 2 ||YX T w|| 2 = ||X T wf = J> t T w) 2 < £ ||w|| 2 ||x 4 || 2 < nR\ 

i i 

Thus by (26), we conclude 

, x , x 4£L> 2 2R 2 ( R \ 

j(Wfc) — D(a k ) < — — — < -— — < e, which gives k > — ■== . 

V ' V ' ~ (fc + l)(fc + 2)a 2 " A(fc + l)(fc + 2) B 



It should be noted that our algorithm has a better dependence on A compared to other state-of-the- 
art SVM solvers like Pegasos [9], SVM-Perf [10], and BMRM [5] which have a factor of ± in their 
convergence rates. Our rate of convergence is also data dependent, showing how the correlation of 
the dataset YX = (j/ixi, . . . , y„x„) affects the rate via the Lipschitz constant L, which is equal 
to the square of the maximum singular value of YX (or the maximum eigenvalue of YXX T Y). 
On one extreme, if x^ is the i-th dimensional unit vector then L = 1, while L = n if all y^Xi are 
identical. 



4.2 Structured Data 

It is noteworthy that applying Pragam to structured data is straightforward. Due to space con- 
straints, we present the details in Appendix E. A key interesting problem there is how to project 
onto a probability simplex such that the image decomposes according to a graphical model. 



5 Experimental Results 

In this section, we compare the empirical performance of our Pragam with state-of-the-art binary 
linear SVM solvers, including liblinear 1 [11], pegasos 2 [9], and BMRM 3 [5]. 

Datasets Table 1 lists the statistics of the dataset. adult 9, astro-ph, news20, real-sim, 
reuters-cl 1, reuters-ccat are from the same source as in [11]. aut-avn classifies docu- 
ments on auto and aviation (http://www.cs.umass.edu/^mccallum/data/sraa.tar.gz). covertype is 
from UCI repository. We did not normalize the feature vectors and no bias was used. 

Algorithms Closest to Pragam in spirit is the line search BMRM (ls-bmrm) which minimizes 
the current piecewise lower bound of regularized i? e m P via a one dimensional line search between 
the current w t and the last subgradient. This simple update was enough for [2] to prove the 1 je 
rate of convergence. Interpreted in the adjoint form, this update corresponds to coordinate descent 
with the coordinate being chosen by the Gauss-Southwell rule [12]. In contrast, Pragam performs 
a parallel update of all coordinates in each iteration and achieves faster convergence rate. So in this 
section, our main focus is to show that Pragam converges faster than ls-bmrm. 

1 http://www.csie.ntu.edu.tw/~cjlin/liblinear 
2 http://ttic. uchicago.edu/~shai/code/pegasos.tgz 
3 http://users.rsise. anu.edu.au/~chteo/BMRM.html 
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Figure 1: Primal function error versus number of iterations. 
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Figure 2: Primal function error versus time. 



We also present the results of liblinear and pegasos. liblinear is a dual coordinate descent optimizer 
for linear SVMs. pegasos is a primal estimated sub-gradient solver for SVM with LI hinge loss. 
We tested two extreme variants of pegasos: pegasos-ri where all the training examples are used 
in each iteration, and pegasos-1 where only one randomly chosen example is used. Finally, we 
also compare with the qp-bmrm proposed in [5] which solves the full QP in (10) in each iteration. 

It should be noted that SVM st ™ d [1] is also a general purpose regularized risk minimizer, and when 
specialized to binary SVMs, the SVMPerf [10, 13] gave the first linear time algorithm for training 
linear SVMs. We did not compare with SVMPerf [10] because its cutting plane nature is very similar 
to BMRM when specialized to binary linear SVMs. 

For Pragam, since the Lipschitz constant L of the gradient of the SVM dual is unknown in prac- 
tice, we resort to [14] which automatically estimates L while the rates presented in Section 4.1 are 
unchanged. We further implemented Pragam-b, the Pragam algorithm which uses SVM bias. In 
this case the inner optimization is a QP with box constraints and a single linear equality constraint. 

For all datasets, we obtained the best A E {2~ 20 , . . . , 2°} using their corresponding validation sets, 
and the chosen A's are given in Appendix D. 

Results Due to lack of space, the figures of the detailed results are available in the Appendix D, 
and the main text only presents the results on three datasets. 

We first compared how fast err t := min t / <t J(w t >) — J(w*) decreases with respect to the iteration 
index t. We used err f instead of J(w t ) — J(w*) because J(w 4 ) in pegasos and Is-bmrm fluctuates 
drastically in some datasets. The results in Figure 1 show Pragam converges faster than Is-bmrm 
and pegasos-n which both have 1/e rates, liblinear converges much faster than the rest algorithms, 
and qp-bmrm is also fast, pegasos-1 is not included because it converges very slowly in terms of 
iterations. 

Next, we compared in Figure 2 how fast err f decreases in wall clock time. Pragam is not fast in 
decreasing err t to low accuracies like 10~ 3 . But it becomes quite competitive when higher accuracy 
is desired, whereas Is-bmrm and pegasos-1 often take a long time in this case. Again, liblinear is 
much faster than the other algorithms. 

Another evaluation is on how fast a solver finds a model with reasonable accuracy. At iteration t, 
we examined the test accuracy of w t > where t 1 :— argmin t / <t J(wf), and the result is presented 
in Figures 3 and 4 with respect to number of iterations and time respectively. It can be seen that 
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Figure 3: Test accuracy versus number of iterations. 
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Figure 4: Test accuracy versus time. 



although Pragam manages to minimize the primal function fast, its generalization power is not 
improved efficiently. This is probably because this generalization performance hinges on the sparsity 
of the solution (or number of support vectors, [15]), and compared with all the other algorithms 
Pragam does not achieve any sparsity in the process of optimization. Asymptotically, all the solvers 
achieve very similar testing accuracy. 

Since the objective function of Pragam-b has a different feasible region than other optimizers which 
do not use bias, we only compared its test accuracy. In Figures 3 and 4, the test accuracy of the 
optimal solution found by Pragam-b is always higher than or similar to that of the other solvers. 
In most cases, Pragam-b achieves the same test accuracy faster than Pragam both in number of 
iterations and time. 



6 Discussion and Conclusions 

In this paper we described a new lower bound for the number of iterations required by BMRM and 
similar algorithms which are widely used solvers for the regularized risk minimization problem. This 
shows that the iteration bounds shown for these solvers is optimum. Our lower bounds are somewhat 
surprising because the empirical performance of these solvers indicates that they converge linearly 
to an e accurate solution on a large number of datasets. Perhaps a more refined analysis is needed to 
explain this behavior. 

The SVM problem has received significant research attention recently. For instance, [9] proposed a 
stochastic subgradient algorithm Pegasos. The convergence of Pegasos is analyzed in a stochastic 
setting and it was shown that it converges in 0(1 /e) iterations. We believe that our lower bounds 
can be extended to any arbitrary subgradient based solvers in the primal including Pegasos. This is 
part of ongoing research. 

Our technique of solving the dual optimization problem is not new. A number of solvers including 
SVM-Light [16] and SMO [17] work on the dual problem. Even though linear convergence is 
established for these solvers, their rates have n- 3 dependence which renders the analysis unusable 
for practical purposes. Other possible approaches include the interior-point method of [18] which 
costs 0(nd 2 log(log(l/e))) time and 0(d 2 ) space where d refers to the dimension of the features. 
LibLinear [11] performs coordinate descent in the dual, and has 0(nd log(l/e)) complexity but 
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only after more than 0(n 2 ) steps. Mirror descent algorithms [19] cost 0(nd) per iteration, but their 
convergence rate is 1/e 2 . These rates are prohibitively expensive when n is very large. 

The 0(1/ y/e) rates for the new SVM algorithm we described in this paper has a favorable depen- 
dence on n as well as A. Although our emphasis has been largely theoretical, the empirical experi- 
ments indicate that our solver is competitive with the state of the art. Finding an efficient solver with 
fast rates of convergence and good empirical performance remains a holy grail of optimization for 
machine learning. 
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Appendix 



A Minimax Theorem on Convex spaces 

The reversal of min and max operators in (16) follows from the following theorem. (Theorem 3 in 
[20]) 

Theorem 8 Let X be a convex space, Y a Hausdorff compact convex space and f : X x Y — > Z a 
function. Suppose that 

• there is a subset U C Z such that a,b £ f(X x Y) with a < b implies U fl (a, b) ^ 0; 

• f(x, •) is lower semicontinuous (l.s.c.) on Y and {y G Y : f(x, y) < s} is convex for each 
x £ X and s E U and 

• f(-,y) is upper semicontinuous (u.s.c.) on X and {x € X : f(x,y) < s} is convex for 
each y € Y and s E U 

Then 

max min f(x,y) — min max f(x,y) 

It is trivial to show that our setting satisfies the above three conditions for the linear form /(w, a) = 
w T A t a. 

In our case Z = R. Since a E A t and the columns of A have Euclidean norm 1, it can be easily 
shown that w is bounded in a ball of radius 1/A. Thus the range space is a finite subset of R and 
we choose U to be the entire range space so that it will satisfy the first condition as /(w, a) is a 
continuous function. 

Also g(a) — /(w, a) and h(w) = /(w, a.) are continuous functions in a and w respectively and 
are thus by definition lower (upper) semicontinuous in a (w). The convexity of the sets in the 2 nd 
and 3 rd condition follows from first principles using definition of convexity. Thus we can use the 
minimax theorem to obtain (16). 

B A linear time algorithm for a box constrained diagonal QP with a single 
linear equality constraint 

It can be shown that the dual optimization problem D(a) from (20) can be reduced into a box 
constrained QP with a single linear equality constraint. 

In this section, we focus on the following simple QP: 

1 n 

min-^d-(a 4 - m 4 ) 2 
Z i=i 

s.t. li <cti < Ui \/i G [n]; 

n 

^^Giai = z. 

Without loss of generality, we assume k < Ui and di ^ for all i. Also assume cr,; ^ because 
otherwise on can be solved independently. To make the feasible region nonempty, we also assume 

^2 ai(5(ai > 0)k + 5{a l < 0)ui) <z<^2 cr t ((5(cr 4 > 0)u t + <5(ct, < 0)^). 

i i 

The algorithm we describe below stems from [21] and finds the exact optimal solution in 0(n) time, 
faster than the 0(n log n) complexity in [7]. 
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With a simple change of variable fa = (Ji{cti — mj), the problem is simplified as 



l sr^i2ni ,i _ f Vi(k ~ mi) if a, > 

mm 2 2- di/3i * ~ 1 - ™i) if o-i < ' 

..<. /lifts.; «e[» 1; w„„e , 

n y 

We derive its dual via the standard Lagrangian. 

L = \T.^ -Y.pt ^ - +Eft(A - «o - a (j> - z'] . 

i i i \ i / 

Taking derivative: 
<9L 

— =%0i-pt + P T -A = / 8 i = ^- 2 (p+-pr + A). (27) 

Substituting into L, we get the dual optimization problem 

min#(A, p+,p7) = 1 £ dfipf - pT + A) 2 - E Pi A + E P^< ~ Xz ' 

i i i 

s.t. p+ > 0, pT > Vi e [n]. 
Taking derivative of D with respect to A, we get: 

E rf r 2 (p+-pr+A)-z' = o. (28) 

The KKT condition gives: 

tf(/3i-i0=0, (29a) 
/ orC9i-«0=0. (29b) 

Now we enumerate four cases. 



1. pi > 0, p i > 0. This implies that l\ — fa — u\, which is contradictory to our assumption. 



2. p+ = 0, p7 = 0. Then by (27), fa = dfx e [/<, u{], hence A e ^«{]. 

3. > 0, p7 = 0. Now by (29) and (27), we have l\ = fa = d~7 2 (p+ + A) > J^ 2 A, hence 
A<d 2 ^andp+=J 2 ^-A. 

4. = 0, p," > 0. Now by (29) and (27), we have u\ = fa = d~7 2 (-pT + A) < d\7 2 \, hence 
A > d 2 v! l and p7 = -d 2 u'; L + A. 

In sum, we have p\ — [dfl^ — A] + and p7 = [A — d 2 u' i ] + . Now (28) turns into 

/(A) := E d; 2 (m - A]+ - [A — rff<] + + A) -z' = 0. (30) 
i v ' 

=:hi(A) 

In other words, we only need to find the root of /(A) in (30). ft-i(A) is plotted in Figure 5. Note that 
hi(X) is a monotonically increasing function of A, so the whole /(A) is monotonically increasing 
in A. Since /(oo) > by z' < J2i u 'i an d f(~°°) < by z' > ^ Z^, the root must exist. 
Considering that / has at most 2n kinks (nonsmooth points) and is linear between two adjacent 
kinks, the simplest idea is to sort {3 2 i l l i ,3 2 i v! i : t S [n]} into s (1 '> < ... < s (2n K If f(s^) and 
/(s ( * +1) ) have different signs, then the root must lie between them and can be easily found because 
/ is linear in [s^ , s^ +1 ^}. This algorithm takes at least 0(n log n) time because of sorting. 
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Figure 5: hi(X) 



Algorithm 2 0(n) algorithm to find the root of /(A). Ignoring boundary condition checks. 

1: Set kink set S <- : i G [n]} U {d?u< : i G [n]}. 

2: while |5| > 2 do 

3: Find median of S: m <- MED(S). 

4: if /(to) >0 then 

5: S <— {x G S 1 : .t < m}. 

6: else 

7: S^{ieS:i>m}. 

8: end if 

9: end while 

10: Return root lf fA~_ u / ( ^ where S = {I, u}. 



However, this complexity can be reduced to 0(n) by making use of the fact that the median of n 
(unsorted) elements can be found in 0(n) time. Notice that due to the monotonicity of /, the median 
of a set S gives exactly the median of function values, i.e., /(MED(5)) = MED({/(x) : x G S}). 
Algorithm 2 sketches the idea of binary search. The while loop terminates in log 2 (2n) iterations 
because the set S is halved in each iteration. And in each iteration, the time complexity is linear to 
| S\, the size of current S. So the total complexity is 0(n). Note the evaluation of /(to) potentially 
involves summing up n terms as in (30). However by some clever aggregation of slope and offset, 
this can be reduced to 0(|5|). 



C Derivation of g* (a) 



To see g*(a) = min^R i ^\ [1 + at — yib] + in (20), it suffices to show that for all a G K": 

P a ) + = miniV [1 + on - y l b] . (31) 

^ — ' has n — ' T 



sup 

pi Q , " - - " 



Posing the latter optimization as: 



in - ^ & s.t. 1 + cti - yib < & > 0. 

h n — » 



mm 

Write out the Lagrangian: 

L = \ + S + ai ~ Vib ~ & - s 

it i 

Taking partial derivatives: 

dL 1 , _ 1n 

^- = --Pi-Pi=0 => Pi S [0, n ], 

dL \ -\ \ -\ 

Q^ = -2^PiVi = o => 2^PiVi = o. 

i i 

Plugging back into L, 

L = ^2p l (l + a l ), s.t. pi G [0, n^ 1 ], ^p iVi = Q. 

i i 

Maximizing L wrt p is exactly the LHS of (31). 
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D Experimental Results in Detail 

The As used in the experiment are: 
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Figure 6: Primal function error versus number of iterations. 
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Figure 7: Primal function error versus time. 
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Figure 8: Test accuracy versus number of iterations. 
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E Structured Output 



In this section, we show how Pragam can be applied to structured output data. When the output 
space y has structures, the label space becomes exponentially large and the problem becomes much 
more expensive. To make the computation tractable, it is common to reparameterize on the cliques 
and to estimate the parameters via graphical model inference. We Below are two examples. 



E.l Margin Scaled Maximum Margin Markov Network 

The maximum margin Markov network formulation (M 3 N) by [22] has the following dual form 
(skipping the primal). Here, a training example i has a graphical model with maximal clique set C ^ . 
For each i and c G ai_ c (y c ) stands for the marginal probability of the clique configuration y c . 
Any possible structured output y can be measured against the given true label y l with a loss function 
£i(y). It is assumed that li(y) decomposes additively onto the cliques: ii(y) = X^cecw ^»,c(yc)- 
In addition, [22] uses a joint feature map fj(y) := f(x*,y), and define Afj(y) := fi(y J ) — fj(y). 
Again, we assume that £j(y) decomposes additively onto the cliques: fj(y) = ^2 c fi,c(yc)> where 
fj,c(yc) : = fc(x% y c ). The simplest joint feature map is defined by: 

(f c (x,y c ),f c /(x,y c /)) = S(y c = y c <)fc(x c , x c ,) = S(y c = y c >) (x c ,x c /) . 

Notice that c and c' are not required to be of the same "type", and S(y c = y c >) will automatically 
filter out incompatible types, e.g., c is an edge and c' is a node. This kernel can be easily vectorized 
into f(x,y) := £) c ( x c (g)(<5(y c = y Ci i), . . . , 6(y c = y c , m(c ))) T ), where m(c) is the number of 
label configurations that clique c can take, and (cross product) is defined by: 

<g> : R s x M* -> R st , (a® b) (i _i) t+j := a t b r 



C 



s.t. 



av(y c )Af*, c (y c ) 



J2a hc (y c 



Y a i,c(yc)^,c(y c 



-,c,y c 



Vi,Vce C w ; 



a ilC (yc)>0 Vi,VceC«,Vy c ; 

a -(yc)= J2 a i,c(yo) V^,V C , C 'eC« :cnc'^0,Vy c 



y c '~ycnc' 



The subroutines of mapping in (23) and (25) can be viewed as projecting a vector onto the probability 
simplex under L2 distance. Moreover, the image now is restricted to the pmf which satisfies the 
conditional independence properties encoded by the graphical model. This is much more difficult 
than projecting onto a box with a linear equality constraint as in SVM, and we can only resort to a 
block coordinate descent as detailed in Appendix E.3. 

E.2 Gaussian Process Sequence Labeling 

[23] proposed using Gaussian process to segment and annotate sequences. It assumes that all train- 
ing sequences have the same length /, and each node can take values in [in). For sequences, the 
maximum cliques are edges: C^' = {(i, t + 1) : t £ [I — 1]}. We use a. to stack up the marginal 
distributions on the cliques: 

|ov(y c ) :i,ce C W ,y c | = {a itt (y t ,y t+ i) :i,te [I - l],y t ,y t +i <E [m]} . 

The marginal probability ai jC (y c ) just aggregates relevant elements in the joint distribution via a 

linear transform. With the joint distribution vector p £ M. nm ' , we can write a — Ap, where A is 
nlm 2 x ran 1 defined by 

\j,t,a,T),(i, y ) = S(i = j A y t = a A y t +i = r). 
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A key difference from the M 3 N is that in Gaussian process, the constraint that the joint density 
lying in the probability simplex is replaced by regularizing the log partition function. So the set 
of marginal distributions on cliques do not have local consistency constraints, i.e., they are free 
variables. The ultimate optimization problem in [23] is an unconstrained optimization: 



min a T Ka. — ^ at T KAe( eiY .) + log ^ exp (ot T K Ae^ y ) . 



(32) 



where K is a kernel on {(x 4 , (yj, y\ + i)) : i,t E [I — 1], yt, Ut+i € [to]}. A simple example is: 

fc((x 4 , (yi,yi +1 )), (x j ,(yi„yi +1 )) =6(t = f ) (s{y\ = y\, A y\ +1 = 

fc(x*,4)<J(j/* = + fc(xj +1 ,4 + i)*(l/j+i = l/t'+l)) • 
If stationality is assumed, we can drop the S(t — t') above and allow node swapping: 

H(*\ (i&yj+i)), (x J , {y J t ,,yi, +1 )) =S(y l t = $ A = 

+ X X fc(xj,,x^(4 = ^). 
pe{t,t+i} ge{t',t'+i} 

The gradient of the first two terms in (32) can be computed with ease. The gradient of the last term 
is 



log ^ cxp (a T KAei t y) 



KE Y [Ae hY ] ■ 



[23] computes this expectation via the forward-backward algorithm. The projections (23) and (25) 
are the same as in Appendix E.l, and Appendix E.3.1 provides a solver for the special case of 
sequence. 



E.3 Efficient projection onto n dimensional simplex factorized by a graphical model 

As a simple extension of Appendix B, we now consider a more involved case. In addition to project- 
ing onto the n dimensional simplex under L 2 distance, we also restrict that the image is factorized 
according to a graphical model. Formally, suppose we have a graphical model over n random vari- 
ables with maximum clique set C. For each clique c, suppose the set of all its possible configuration 
is V c , and the pmf of the marginal distribution on clique c is a c (y c ), where y c € V c . Given a set 
{m c e R |y<:| : c e C}, we want to find a set {a c E R |v; : c e C) which minimizes: 

min \^ d l\\ a c-m c \\l 

C 

s.t. ^a c (y c ) = l VceC; 

a c (y c )>0 VceC,Vy c ; 

X a c(yc)= Myc) Vcnc' ^0,Vy cnc <- 

yc~y cn <:< yc'~y c n c ' 

The last (set of) constraint enforces the local consistency of the marginal distributions, and y c <~ 
y cnc ' means the assignment of clique c matches y cnc ' on the subset c n d . If the graphical model 
is tree structured, then it will also guarantee global consistency. We proceed by writing out the 
standard Lagrangian: 

L =\ d c H( a c(y c ) - m c (y c )) 2 -J2 Xc (J2 a ^ y ^ ~ 1 ) ~ Yl ^(y c )a c (y c ) 

c y c c \ y c / c,y c 

- J2 X Mc, C '(y c nc') X a c(yc)- a c(yc) • 

c,c':cnc'#0y c nc' \yc:yc~y c nc' y c '~ycnc' / 
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Algorithm 3 A coordinate descent scheme for minimizing the dual problem (34). 



Randomly initialize {A c : c} , {£ c (y c ) : c,y c } , {^ c , c '(y c n c ') : c, c',y cnc '}- 



while not yet converged do 

Fixing £ c (y c ), apply conjugate gradient to minimize the unconstrained quadratic form in (34) 
with respect to {A c : c} and c '(y cnc ') : c, c', y c nc'}- The necessary gradients are given 
in (35b) and (35c). 

4: Set£ c (y c )^- [-d 2 m c (y c ) - A c - Y. c > ^c.c'{y c nc')] + for all c e C and y c . 
5: end while 

6: Compute a c (y c ) according to (33). 



Taking derivative over a c (y c ): 
dL 

da ^ j = d 2 c (a c (y c ) - m c (y c )) - A c - £ c (y c ) - E Mc,c'(y c nc') + E Mc',c(y c nc') = 0, 

a c (y c ) = m c (y c ) + d" 2 ^A c + £ c (y c ) + E ^ c > c ' ( ycnc ')^ ' ( 33 ^ 

where /w(y cnc /) := /i C)C - (y cnc ' ) - /V iC (y cnc ')' Plugging it back into L, we derive the dual 
problem: 

min D(X C , £ c (y c ), Mc,c'(y c nc')) =^ E d c 2 E ( Ac + ^ c ( yc ) + E ^ c - c ' ( y cnc) J (34) 

c y c \ c' / 

+ EE mc ( yc ) ( Ac + ^(y c ) + E^ c < c '( ycnc ') ) ~ E Ac 

c y c \ c' / c 

«■*■ Cc(y c ) > o. 

Looking at the problem, it is essentially a QP over A c , £ c (y c ), Hc,c'{y c nc') with the only constraint 
that £c(yc) > 0. Similar to B, one can write £ c (yc) as a hinge function of A c and fj, CtC i (y c nc')- 
However since it is no longer a single variable function, it is very hard to apply the median trick here. 
So we resort to a simple block coordinate descent. The optimization steps are given in Algorithm 3 
with reference to the following expressions of gradient: 



dD 
dD 

8\r. 



3D 



fyic,c'(y c nc') 
From (35a) and £ c (y c ) > 0, we can derive 



-d c 2 (X c + Cc(y c ) + E Mc,c'(ycnc')) + m c(yc) = (35a) 

c' 

dc 2 E ( Ac + ^ c ( yc ) + E Mc,c'(y c nc') I + E mc ( yc ) _ 1 (35b) 

y c \ C / y c 

E ( A c +? c ( y c)+E^( y c,c-)) + E ™c( y c)- ( 35c > 



y 

<C 2 



Cc(y c ) = 



-d 2 c m c (y c ) - X c - Hc,c>(y c nc>) 



(36) 
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E.3.1 Special case: sequence 

Suppose the graph is simply a sequence: x\ — x 2 — ■ ■ ■ — xl and each node can take value in [m]. 
Then the cliques are {(x t ,x t +i) : t £ [L — 1]} and the primal is: 



*=i *,j=i 

s.t. ^a t (i,j) = l Vte[L-l]; 

a*(i,i)>0 Vte[L-l],i,je[m]; 
5^at(i,j) = Y J a t+i{j,k) Vt G [i-2],j G H- 

i A; 

Proceeding with the standard Lagrangian: 

L-l m L-l / \ L-l 

t=i i,j=i t=i \ »j / t=i »,j 

t=l j \ i fc / 

Taking derivative over a t (i,j): 
dL 

a /■ -x = d 2 t (a t (i,j) - m t (i,j)) - \ t - j) ~ Mt(j') + Mt-iW = 

=> at(i,j) = ^r 2 ( A * j) + - Ht-i(i)) +m t (i,j), (37) 

where we define fio(j) := 0. Plugging into i, we derive the dual problem: 

L-l 

2 



1 



(38) 



t=i »,j 

L-l L-l 

+ Y ^ m *( i 'J')( A * + Ct(*,i) + Mt(j) - Mt-i(i)) - ^ A 

t=l i,j t=l 

&(*,j)>0. V*e[L-i],i,je[m]. 



t 



d£ ,_ 2 



= rf t 2 (A t +e t (i,j)+Mtt/)-Mt-i(i))+m t (i,j) = Vt G [L - 1] 

=>tt{i,j) = [-d 2 t m t {i,j) - \ t - Ht{j) + Ht-i{i)]+ 
0^ = rf t " 2 5I(A t + Mt(j) -Mt-i(0) + 5^mt(i, j) - 1 Vt e [L - 1] 

^^=rf- 2 V(A t +6(j,i)+MtW-Mt-i(i)) Vte[L-2] 

+ d t+i Xl( A *+ i + &+i (*>.?) + Mt+iC?) - MtW) + m *CM) - XI m *+i 

3 3 j 

where we further define ^L-i(j) '■= 0. Obviously it takes 0(Lm 2 ) to compute all the gradients, 
and so is 
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